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Abstract 

This paper presents our contribution to the ChaLearn 
Challenge 2015 on Cultural Event Classification. The chal¬ 
lenge in this task is to automatically classify images from 
50 different cultural events. Our solution is based on the 
combination of visual features extracted from convolutional 
neural networks with temporal information using a hierar¬ 
chical classifier scheme. We extract visual features from 
the last three fully connected layers of both CaffeNet (pre- 
trained with ImageNet) and our fine tuned version for the 
ChaLearn challenge. We propose a late fusion strategy that 
trains a separate low-level SVM on each of the extracted 
neural codes. The class predictions of the low-level SVMs 
form the input to a higher level SVM, which gives the final 
event scores. We achieve our best result by adding a tem¬ 
poral refinement step into our classification scheme, which 
is applied directly to the output of each low-level SVM. Our 
approach penalizes high classification scores based on vi¬ 
sual features when their time stamp does not match well an 
event-specific temporal distribution learned from the train¬ 
ing and validation data. Our system achieved the second 
best result in the ChaLearn Challenge 2015 on Cultural 
Event Classification with a mean average precision ofO. 767 
on the test set. 


1. Motivation 

Cultural heritage is broadly considered a value to be pre¬ 
served through generations. From small town museums to 
worldwide organizations like UNESCO, all of them aim at 
keeping, studying and promoting the value of culture. Their 
professionals are traditionally interested in accessing large 
amounts of multimedia data in rich queries which can ben¬ 
efit from image processing techniques. For example, one of 
the first visual search engines ever, IBM’s QBIC (3, was 
showcased for painting retrieval from the Hermitage Mu¬ 


seum in Saint Petersburg (Russia). 

A cultural expression which is typically not found in 
a museum are social events. Every society has created 
through years collective cultural events celebrated with cer¬ 
tain temporal periodicity, commonly yearly. These festiv¬ 
ities may widely spread geographically, like the Chinese 
New Year’s or Indian Holi Festival, or much more local¬ 
ized like the Carnival in Rio de Janeiro or the Castellers 
(human towers) in Catalonia. An image example for each 
of these four cultural events is presented in Figure All of 
them have a deep cultural and identity nature that motivates 
a large amount of people to repeat very particular behavioral 
patterns. 

The study and promotion of such events has also bene¬ 
fited from the technological advances that have popularized 
the acquisition, storage and distribution of large amounts of 
multimedia data. Cultural events across the globe are at the 
tip of a click, improving both the access of culture lovers to 
rich visual documents, but also their touristic power or even 
exportation to new geographical areas. 

However, as in any classic multimedia retrieval problem. 
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Figure 1. Examples of images depicting cultural events. 
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while the acquisition and storage of visual content is a pop¬ 
ular practice among event attendees, their proper annotation 
is not. While both personal collections and public reposito¬ 
ries contain a growing amount of visual data about cultural 
events, most of it is not easily available due to the almost 
non-existent semantic metadata. Only a minority of photo 
and video uploaders will add the simplest form of annota¬ 
tion, a tag or a title, while most users will just store their 
visual content with no further processing. Current solutions 
will mostly rely in on temporal and geolocation metadata 
attached by the capture devices, but also these sources are 
unreliable for different reasons, such as erroneous set up of 
the internal clock of the cameras, or the metadata removal 
policy applied in many photo sharing sites to guarantee pri¬ 
vacy. 

Cultural event recognition is a challenging retrieval task 
because of its strong semantic dimension. The goal of cul¬ 
tural event recognition is not only to find images with sim¬ 
ilar content, but further to find images that are semantically 
related to a particular type of event. Images of the same 
cultural event may also be visually different. Thus, major 
research questions in this context are, (i) if content-based 
features are able to represent the cultural dimension of an 
event and (ii) if robust visual models for cultural events can 
be learned from a given set of images. 

In our work, we addressed the cultural event recogni¬ 
tion problem in photos by combining the visual features ex¬ 
tracted from convolutional neural networks (convnets) with 
metadata (time stamps) of the photos in the hierarchical fu¬ 
sion scheme shown in Figure]^ The main contributions of 
our paper are: 

• Late fusion of the neural codes from both the fine- 
tuned and non-fine-tuned fully connected layers of the 
CaffeNet lEl convnet. 

• Generation of spline-based temporal models for cul¬ 
tural events based on photo metadata crawled from the 
web. 

• Temporal event modeling to refine visual-based classi¬ 
fication as well as noisy data augmentation. 

This paper is structured as follows. Section [^overviews 
the related work, especially in the field of social event de¬ 
tection and classification. Section describes a temporal 
modeling of the cultural events which has been applied both 
on the image classification and data augmentation strategies 
presented in Section and Section respectively. Exper¬ 
iments on the ChaLearn Cultural Event Dataset la are re¬ 
ported in Section!^ and conclusions drawn in Section 

This work was awarded with the 2nd prize in the 
ChaLearn Challenge 2015 on Cultural Event Classification. 


Our source code, features and models are publicly available 

onlin^B 

2. Related work 

The automatic event recognition on photo and video col¬ 
lections has been broadly addressed from a multimedia per¬ 
spective, further than just the visual one. Typically, visual 
content is accompanied by descriptive metadata such as a 
time stamp from the camera or an uploading site, a geolo¬ 
cation from a GPS receiver or some text in terms of a tag, 
a title or description is available. This additional contex¬ 
tual data for a photo is highly informative to recognize the 
depicted semantics. 

Previous work on social events has shown that tem¬ 
poral information provides strong clues for event cluster¬ 
ing (271. In the context of cultural event recognition, we 
consider temporal information a rather “asymmetric clue” 
where time provides an indicator to rather reject a given hy¬ 
pothesis than to support it. On the one hand, given a pre¬ 
diction (e.g. based on visual information) for a photo for a 
particular event, we can use temporal information, i.e. the 
capture date of the photo, to easily reject this hypothesis if 
the capture date does not coincide with the predicted event. 

In this case temporal information represents a strong clue. 
On the other hand, cultural events may take place at the 
same time. As a consequence, the coincidence of a cap¬ 
tured date with the predicted event in this case represents 
just a weak clue. We take this “asymmetric nature” in our 
temporal refinement scheme (see Section [43] ) into account. 

Temporal information has further been exploited for 
event classification by Mattive et al. ca. The authors 
define a two-level hierarchy of events and sub-events which 
are automatically classified based on their visual informa¬ 
tion described as a Bag of Visual Words. All photos are 
first classified visually. Next, the authors refine the classifi¬ 
cation by enforcing temporal coherence in the classification 
for each event and sub-event which considerably improved 
the purely visual classification. 

A similar approach is applied by Bossard et al. O, ex¬ 
ploiting temporal information to define events as a sequence 
of sub-events. The authors exploit the temporal ordering of 
photos and model events as a series of sub-events by a Hid¬ 
den Markov Model (HMM) to improve the classification. 

A very similar problem to Cultural Event Recognition, 
namely “Social Event Classification”, was formulated in 
the MediaEval Social Event Detection benchmark in 2013 
(2T]|201. The provided dataset contained 57,165 images 
from Instagram together with available contextual meta¬ 
data (time, location and tags) provided by the API. The 
classification task considered a first decision level between 

^https://imatge.upc.edu/web/resources/ 
cultural-event-recognition-computer-vision-software 
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Figure 2. Global architecture of the proposed system. 


event and non-event and, in the case of event, eight seman¬ 
tic classes were defined to be distinguished: concert, con¬ 
ference, exhibition, fashion, protest, sports, theatre/dance, 
other. The results over all participants showed that the clas¬ 
sification performance strongly benefits from multimodal 
processing combining content and contextual information. 
Pure contextual processing as proposed in 1^ and im 
and yielded the weakest results. The remaining participants 
proposed to add visual analysis to the contextual process¬ 
ing. CERTH-ITI (231 combined pLSA on the 1,000 most 
frequent tags with a dense sampling of SIFT visual fea¬ 
tures, which were later coded with VLAD. They observed a 
complementary role between visual and textual modalities. 
Brenner and Izquierdo (H combined textual features with 
the global GIST visual descriptor, which is capable of cap¬ 
turing the spatial composition of the scene. The best perfor¬ 
mance in the Social Event Classification task was achieved 
by m- They combine processing of textual photo de¬ 
scriptions with the work from mi for visual processing, 
based on bag of visual words aggregated in different fash¬ 
ions through events. Their results showed that visual in¬ 
formation is the best option to discriminate between event 
/ non-event and that textual information is more reliable to 
discriminate between different event types. 


In terms of benchmarking, a popular strategy is to re¬ 
trieve additional data to extend the training dataset. The 
authors of fT2\ . for example, retrieved images from Flickr 
to build unigram language models of the requested event 
types and locations in order to enable a more robust match¬ 
ing with the user-provided query. We explored a similar ap¬ 
proach in for cultural event recognition. Results, however 
showed that extending the training set this did not improve 
results but made them even worse. 


3. Temporal models 

Cultural events usually occur at a regular basis and thus 
have a repetitive nature. For example, “St. Patrick’s day” 
always takes place on March, 17, “La Tomatina” is always 
scheduled for the last week of August, and the “Cameval 
of Rio” usually takes place at some time in February and 
lasts for one week. More complex temporal patterns exist, 
for example, for cultural events coupled to the lunar calen¬ 
der which changes slightly each year. An example is the 
“Maslenitsa” event in Russia is which is scheduled for the 
eighth week before Eastern Orthodox Easter. 

The temporal patterns associated with cultural events are 
a valuable clue for their recognition. A photo captured, for 
example, in December will very unlikely (except for erro¬ 
neous date information) show a celebration of St. Patrick’s 
day. While temporal information alone is not sufficient to 
assign the correct event (many events may take place con¬ 
currently), we hypothesize that temporal information pro¬ 
vides strong clues that can improve cultural event recogni¬ 
tion. 

To start with temporal processing, first temporal models 
have to be extracted from the data. Temporal models for 
cultural events can be either generated manually in advance 
or extracted automatically from metadata of related media. 
We propose a fully automatic approach to extract temporal 
models for cultural events. The input to our approach is a set 
of capture dates for media items that are related to a given 
event. Capture dates may be, for example, extracted from 
social media sites like Flickr or from the metadata embed¬ 
ded in the photos (e.g. EXIF information). In a first step, we 
extract the day and month of the capture dates and convert 
them into a number d between 1 and 365, encoding the day 
in the year when the photo was taken. From these numbers, 
we compute a temporal distribution T{d) of all available 
capture dates. Assuming that a cultural event takes place 
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days of the year days of the year 


(a) Maslenitsa (b) Timkat 

Figure 3. Temporal spline models for the “Maslenitsa” and the 
“Timkat” event: (a) for normally distributed data the model be¬ 
comes approximately Gaussian-shaped; (b) the uncertainty of the 
distribution is reflected in the temporal model. 


annually, it is straight-forward to model the temporal distri¬ 
bution with a Gaussian model. Gaussian modeling works 
well when a sufficient number of timestamps exists. For 
sparse data, however, with a few timestamps only, the dis¬ 
tribution is likely to become non-Gaussian and thus model 
fitting fails in generating accurate models. Additionally, the 
timestamps of photos are often erroneous (or overwritten by 
certain applications) yielding strong deviations of the ideal 
distribution. To take the variability that is present in the data 
into account, a more fiexible model is required. We model 
the distribution t{d) by a piecewise cubic smoothing spline 
□ To generate the final model T, we evaluate the spline 
over the entire temporal domain and normalize it between 0 
and 1. Given a photo i with a certain timestamp di, the fit¬ 
ted temporal model Tc{di) provides a score Sc that the photo 
refers to the associated event c. The fiexible spline model 
enables the modeling of sparse and non-Gaussian distribu¬ 
tions and further to model events with more complex than 
annual occurrence patterns. 

Figurej^shows temporal models for two example events. 
The “Maslenitsa” ( |3(a)| ) takes place between mid of Febru¬ 
ary and mid of March (approx, days 46-74). This corre¬ 
sponds well with the timestamps extracted from the related 
media items, resulting in a near Gaussian-shaped model. 
The “Timkat” event always takes place on January 19. This 
is accurately detected by the model, which has its peak at 
day 19. The photos related to this event, however, have 
timestamps that are distributed across the entire year. This 
property of the underlying data is refiected in the model, 
giving low but non-zero scores to photos with timestamps 
other than the actual event date. 

Figure shows the temporal models extracted from the 
training and validation data for all 50 classes. We observe 
that each model (row) exhibits one strong peak which repre¬ 
sents the most likely date of the event. Some models contain 
additional smaller side-peaks learned from the training data 
which reflect the uncertainty contained in the training data. 
The events are distributed over the entire year, some events 
occur at the same time. 
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Figure 4. Automatically generated temporal models for each event 
class. For each event we observe a typical pattern of recording 
dates exhibiting one strong peak. The colors range from dark blue 
(0) to red (1). 


The generated temporal models can be used to refine de¬ 
cisions made during classification (see Section 4^) as well 
as for the filtering of additional data collections to reduce 
noise in the training data (see Section [5^ . 


4. Image Classification 

The automatic recognition of a cultural event from a 
photo is addressed in this paper with the system architec¬ 
ture presented in Figure We propose combining the vi¬ 
sual features obtained at the fully connected layers of two 
versions of the same Caffenet convolutional neural network: 
the original one and a modified version fine-tuned with pho¬ 
tos captured at cultural events. A low-level SVM classi¬ 
fier is trained for each visual feature, and its scores refined 
with the temporal model described in Sectionj^ Finally, the 
temporally modified classification scores are fused in a final 
high-level SVM to obtain the final classification for a given 
test image. 

4.1. Feature extraction 

Deep convolutional neural networks (convnets) have re¬ 
cently become popular in computer vision, since they have 
dramatically advanced the state-of-the-art in tasks such as 
image classification na, retrieval m or object detection 

cniiii 

Convnets are typically defined as a hierarchical struc¬ 
ture of a repetitive pattern of three hidden layers: (a) a 
local convolutional filtering (bidimensional in the case of 
images), (b) a non-linear operation, (commonly Rectified 
Linear Units - ReLU) and (c) a spatial local pooling (typi¬ 
cally a max operator). The resulting data structure is called 
2 i feature map and, in the case of images, they correspond 
to 2D signals. The deepest layers in the convnet do not fol¬ 
low this pattern anymore but consist of fully connected (FC) 
layers: every value (neuron) in the fully connected layer is 
connected to all neurons from the previous layers through 
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some weights. As these fully connected layers do not ap¬ 
ply any spatial constrain anymore, they are represented as 
single dimensional vectors, further referred in this paper as 
neural codes (U . 

The amount of layers is a design parameter that, in the 
literature, may vary from three El to nineteen 1241 . Some 
studies indicate that the first layers capture finer pat¬ 
terns, while the deeper the level, the more complex patterns 
are modeled. However, there is no clear answer yet about 
how to find the optimal architecture to solve a particular vi¬ 
sual recognition problem. The design of convnets is still 
mainly based on trial-and-error process and the expertise of 
the designer. In our work we have adopted the public imple¬ 
mentation of CaffeNet ifTSl . which was inspired by AlexNet 
ca. This convnet is defined by 8 layers, being the last 3 of 
them fully connected. In our work we have considered the 
neural codes in these layers (FC6, FC7 and FC8) to visually 
represent the image content. 

Apart from defining a convnet architecture, it is neces¬ 
sary to learn the parameters that govern the behaviour of the 
filters in each layer. These parameters are obtained through 
a learning process that replaces the classic handcrafted de¬ 
sign of visual features. This way, the visual features are 
optimized for the specific problems that one wants to solve. 
Training a convnet is achieved through backpropagation, a 
high-computational effort that has been recently boosted by 
the affordable costs of GPUs. In addition to the computa¬ 
tional requirements, a large amount of annotated data is also 
necessary. Similarly to the strategy adopted in the design of 
the convnet, we have also used the publicly available fil¬ 
ter parameters of CaffeNet ITSl . which had been trained for 
1,000 semantic classes from the ImageNet dataset m 

The cultural event recognition dataset aimed in this paper 
is different from the one used to train CaffeNet, both in the 
type of images and in the classification labels. In addition, 
the amount of photos of annotated cultural events available 
in this work is much smaller than the large amount of im¬ 
ages available in ImageNet. We have addressed the situation 
by also considering the possibility of fine tuning CaffeNet, 
that is, providing additional training data to an existing con¬ 
vnet which had been trained for a similar problem. This 
way, the network parameters are not randomly initialized, 
as in a training from scratch, but are already adjusted to a 
solution which is assumed to be similar to the desired one. 
Previous works os El El have proved that fine-tuning ca 
is an efficient and valid solution to address these type of sit¬ 
uations. In the experiments reported in Section we have 
used feature vectors from both the original CaffeNet and its 
fine-tuned version. 

4.2. Hierarchical fusion 

The classification approach applied in our work is using 
the neural codes extracted from the convnets as features to 


train an classifier (Support Vector Machines, SVMs, in our 
case), as proposed in lEl- As we do not know a priori which 
network layer are most suitable for our task, we decide to 
combine several layers using a late fusion strategy. 

The neural codes obtained from different networks and 
different layers may have strongly different dimensionality 
(e.g. from 4,096 to 50 in our setup). During the fusion of 
these features we have to take care that features with higher 
dimensionality do not dominate the features with lower di¬ 
mensionality. Thus, we adopted a hierarchical classification 
scheme to late fuse the information from the different fea¬ 
tures in a balanced way 1^ . 

At the lower level of the hierarchy we train separate 
multi-class SVMs (using one-against-one strategy O) for 
each type of neural code. We neglect the final predictions 
of the SVM and retrieve the probabilities of each sample 
for each class. The probabilities obtained by all lower-level 
SVMs form the input to the higher hierarchy level. 

The higher hierarchy level consists of an SVM that takes 
to probabilistic output of the lower-level SVMs as input. 
This assures that all input features are weighted equally in 
the final decision step. The higher-level SVM is trained di¬ 
rectly from the probabilities and outputs a prediction for the 
most likely event. Again we reject the binary prediction and 
retrieve the probabilities for each event as the final output. 

4.3. Temporal Refinement 

While visual features can easily be extracted from each 
image, the availability of temporal information depends on 
the existence of suitable metadata. Thus, temporal informa¬ 
tion must in general be considered to be a sparsely available 
feature. Due to its sparse nature, we propose to integrate 
temporal information into the classification process by re¬ 
fining the classifier outputs. This allows us to selectively 
incorporate the information only for those images where 
temporal information is available. 

The basis for temporal refinement are the temporal mod¬ 
els introduced in Section The models Tc with c = 
1,..., C and C the number of classes, represent for each 
event class c and each day of the year d, a score s repre¬ 
senting the probability of a photo captured in a given day 
to belong to the event: s = Tc{d). For a given image with 
index i, we first extract the day of the year di from its cap¬ 
ture date and use it as an index to retrieve the scores from 
the temporal models of all event classes: Sc = Tc{di), with 

Given a set of probabilities Pi for image i obtained from 
a classifier, the refinement of these probabilities is per¬ 
formed as follows. First, we compute the difference be¬ 
tween the probabilities and the temporal scores: di = Pi—s. 
Next, we distinguish between two different cases: 

(I) di{c) < 0: Negative differences mean that the prob¬ 
ability for a given class predicted by the classifier is less 
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than the temporal score for this class. This case may eas¬ 
ily happen as several events may occur at the same time 
as the photo was taken. The temporal models indicate that 
several events may be likely. Thus, the temporal informa¬ 
tion provides only a weak clue that is not discriminative. To 
handle this case, we decide to trust the class probabilities 
by the classifier and to ignore the temporal scores by setting 
d = max{d^ 0). 

(II) di{c) > 0. In this case the temporal score is lower 
than the estimate of the classifier. Here, the temporal score 
provides a strong clue that indicates an inaccurate predic¬ 
tion of the classifier. In this case, we use the difference 
di (c) to re-weight the class probability. 

The weights Wi are defined as Wi = max{d^ 0) - 1 - 1. The 
final re-weighting of the probabilities Pi is performed by 
computing Pi = Pijwi. In case (I) the temporal scores do 
not change the original predictions of the classifier. In case 
(II) the scores are penalized by a fraction that is propor¬ 
tional to the disagreement between the temporal scores and 
the prediction of the classifier. 

5. Data Augmentation 

The experiments described in Section were conducted 
with the ChaLearn Cultural Event Recognition dataset m, 
which was created by downloading photos from Google Im¬ 
ages and Bing search engines. Previous works (I 6 l| 28 l| 6 l 
have reported gains when applying some sort of data aug¬ 
mentation strategy. 

We have explored two paths for data augmentation: ar¬ 
tificial transformations on the test images and an extension 
of the training dataset by downloading additional data from 
Flickr. 

5.1. Image transformations 

A simple and classic method for data augmentation is to 
artificially generate transformations of the test image and 
fuse the classification scores obtained in each transforma¬ 
tion. We adopted the default image transformations asso¬ 
ciated to CaffeNet ca, this is an horizontal mirroring and 
5 crops in the input image (four corners and center). The 
resulting neural codes associated to each fully connected 
layer were fused by averaging the 10 feature vectors gener¬ 
ated with the 10 image transformations. 

5.2. External data download 

We decided to extend the amount of training data to fine- 
tune our convnet, as discussed in Section [4T] By doing this, 
we expected to reduce the generalization error of the learned 
model by having examples coming from a wider origin of 
sources. 

The creators of the ChaLeam Cultural Event Recogni¬ 
tion dataset m described each of the 50 considered events 


with pairs of title and geographical location; such as Car¬ 
nival Rio-BraziU Obon-Japan or Harbin Ice and Snow 
Festival-China. This information allows generating queries 
on other databases to obtained an additional set of labeled 
data. 

Our chosen source for the augmented data was the Flickr 
photo repository. Its public API allows to query its large 
database of photos and filter the obtained results by tags, 
textual data search and geographical location. We generated 
3 sets of images from Flickr, each of them introducing a 
higher degree of refinement: 

90k set: Around 90,000 photos retrieved by matching the 
provided event title on the Flickr tags and content 
metadata fields. 

21k set: The query from the 90k set was combined with a 
GPS filtering based on the provided country. 

9k set: The query from the 21k set was further with manu¬ 
ally selected terms from the Wikipedia articles related 
to the event. In addition, the Flickr query also tog¬ 
gled on an interestingness fiag which improved the di¬ 
versity of images in terms of users and dates. Other¬ 
wise, Flickr would provide a list sorted by upload date, 
which will probably contain many similar images from 
a reduced set of users. 

The temporal models Tc presented in Sectionj^were also 
used to improve the likelihood that a downloaded photo ac¬ 
tually belongs to a certain event. Given a media item i re¬ 
trieved for a given event class c, we extract the day of cap¬ 
ture di from its metadata and retrieve the score Sc = Tc{di) 
from the respective temporal model. Next, we threshold the 
score to remove items that are unlikely under the temporal 
model. To assure a high precision of the filtered media col¬ 
lection, the threshold should be set to a rather high value, 
e.g. 0.9. Figure gives two examples of media collections 
retrieved for particular events. We provide the distribution 
of capture dates with the pre-trained temporal models. 

The Flickr IDs of this augmented dataset filtered by min¬ 
imum temporal scores have been published in ISON format 
from the URL indicated in Section [T] 

6. Experiments 

6.1. Cultural Event Recognition dataset 

The Cultural Event Recognition dataset O depicts 50 
important cultural events all over the world. In all the im¬ 
age categories, garments, human poses, objects and context 
do constitute the possible cues to be exploited for recog¬ 
nizing the events, while preserving the inherent inter- and 
intra-class variability of this type of images. The dataset is 
divided in three partitions: 5,875 images for training, 2,332 
for validation and 3,569 for test. 


6 



Initial version of the paper accepted at the CVPR Workshop ChaLeam Looking at People 2015 



(a) Desfile de Silleteros 



days of the year 


(b) Carnival of Venice 

Figure 5. Two examples of retrieved image collections from Flickr 
and their temporal distribution, (a) the retrieved images match 
well the pre-trained temporal model, (b) the temporal distribution 
shows numerous outliers which are considered unlikely given the 
temporal model. The proposed threshold-based filtering removes 
those items. 


6.2. Experimental setup 

We employ two different convnets as input (see Sec¬ 
tion [TT]): the original CaffeNet trained on 1,000 Imagenet 
classes, and a fine-tuned version of CaffeNet trained during 
60 epochs on the 50 classes defined in the Chalearn Cul¬ 
tural Recognition Dataset. Fine-tuning of the convnet was 
performed in two stages: in a first one the training partition 
was used to train and the validation partition to estimate the 
training loss and allow the network to learn. In a second 
stage, the two partitions were switched so that the network 
had to learn the optimal features from all the available la¬ 
beled data. 

From both convnets we extracted neural codes from lay¬ 
ers FC6 and FC7 (each of 4,096 dimensions), as well as FC8 
(the top layer with a softmax classifier), which has 1000 di¬ 
mensions for the original CaffeNet and 50 for the fine-tuned 
network. Both feature extraction and fine tuning have been 
performed using the Caffe CD deep learning framework. 

As presented in Section |4.2[ a classifier was trained for 
each of the 6 neural codes, in addition to the one used for 
late fusion. The implementation of Libsvm library ID of the 
linear SVM was used, with parameter (7 = 1 determined 
by cross validation and grid search and probabilistic output 
switched on. 

Each image was scored for each of the 50 considered cul¬ 
tural events and results were measured by a precision/recall 
curve, whose area under the curve was used to estimate the 
average precision (AP). Numerical results are averaged over 
the 50 events to obtain the mean average precision (mAP). 



FC6 

FC7 

FC8 

Raw layer 

+ temporal refinement 

0,6832 

0,6893 

0,6669 

0,6730 

0,6079 

0,6152 


Table 1. Results on single layer raw neural codes. 


Fine-tunned FC6-FC7-FC8 

0,6919 

+ raw FC6-FC7-FC8 

0,7038 

+ temporal refinement 

0,7357 


Table 2. Results on fine-tuned and fused multi-layer codes. 


More details about the evaluation process can be found in 

0 . 


6.3. Results on the validation dataset 

A first experimentation was performed to assess the im¬ 
pact of temporal refinement on the default CaffeNet, that is, 
with no fine-tunning. Results in Table indicate diverse 
performance among the fully connected layers, being FC6 
the one with a highest score. Temporal refinement slightly 
increases the mAP consistently in all layers. 

The preliminary results were further extended to com¬ 
pare the performance of the three neural codes (FC6, FC7 
and FC8) when temporally refined and finally comple¬ 
mented with the features from the original CaffeNet. The 
results shown in Table indicate a higher impact of tem¬ 
poral refinement than in the case of single layers, and an 
unexpected gain by adding the raw neural codes from Caf¬ 
feNet. 

Our experimentation on the additional data downloaded 
from Flickr was unsuccessful. The selected dataset was the 
9k Flickr one with a restrictive threshold of 0.9 on the tem¬ 
poral score. With this procedure we selected 5,492 images, 
which were added as training samples for fine tuning. We 
compare the impact of adding this data into training only on 
the softmax classifier at the last layer of CaffeNet, obtaining 
a drop in the mAP from 0.5821 to 0.4547 when adding the 
additional images to the already fine-tuned network. We hy¬ 
pothesize that the visual nature of the images downloaded 
from Flickr differs from the one of the data crawled from 
Google and Bing by the creators of the ChaLearn dataset. 
A visual inspection on the augmented dataset did not pro¬ 
vide any hints that could expalin this behaviour. 

6.4. Results on the test dataset 

The best configuration obtained with the validation 
dataset was used on the test dataset to participate in the 
ChaLeam 2015 challenge. Our submission was scored by 
the organizers with a mAP of 0, 767, the second best perfor¬ 
mance among the seven teams which completed the submis¬ 
sion, out of the 42 participants who had initially registered 
on the challenge website. 
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7. Conclusions 

The presented work proves the high potential of the vi¬ 
sual information for cultural event recognition. This result 
is especially sounding when contrasted with many of the 
conclusions made in the MediaEval Social Event Detec¬ 
tion task 1^ . where it was frequently observed that visual 
information was less reliable than contextual metadata for 
event clustering. This difference may be caused by the very 
salient and distinctive visual features that often make cul¬ 
tural events attractive and unique. The dominant green in 
Saint Patrick’s parades, the vivid colors from the Holi Ees- 
tivals or the skull icons from the Dia de los Muertos 

In our experimentation the temporal refinement has pro¬ 
vided modest gain. We think this may be caused by the low 
portion of images with available EXIT metadata, around 
24% according to our estimations. In addition, we were 
also surprised by the loss introduced by the Elickr data aug¬ 
mentations. We plan to look at this problem more closely 
and figure out the difference between the ChaLearn dataset 
and ours. 

Einally, it must be noticed that the quantitative values 
around 0.7 may be misleading, as in this dataset every im¬ 
age belonged to one of the 50 cultural events. Eurther edi¬ 
tions of the ChaLeam challenge may also introduce the no 
event class as in MediaEval SED 2013 ED to, this way, bet¬ 
ter reproduce a realistic scenario where the event retrieval is 
performed in the wild. 
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